Clustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets

نویسندگان

  • Noha Negm
  • Mohamed Amin
  • Passent Elkafrawy
  • Abdel Badeeh M. Salem
چکیده

Document Clustering is one of the main themes in text mining. It refers to the process of grouping documents with similar contents or topics into clusters to improve both availability and reliability of text mining applications. Some of the recent algorithms address the problem of high dimensionality of the text by using frequent termsets for clustering. Although the drawbacks of the Apriori algorithm, it still the basic algorithm for mining frequent termsets. This paper presents an approach for Clustering Web Documents based on Hashing algorithm for mining Frequent Termsets (CWDHFT). It introduces an efficient Multi-Tire Hashing algorithm for mining Frequent Termsets (MTHFT) instead of Apriori algorithm. The algorithm uses new methodology for generating frequent termsets by building the multi-tire hash table during the scanning process of documents only one time. To avoid hash collision, Multi Tire technique is utilized in this proposed hashing algorithm. Based on the generated frequent termset the documents are partitioned and the clustering occurs by grouping the partitions through the descriptive keywords. By using MTHFT algorithm, the scanning cost and computational cost is improved moreover the performance is considerably increased and increase up the clustering process. The CWDHFT approach improved accuracy, scalability and efficiency when compared with existing clustering algorithms like Bisecting K-means and FIHC. Keywords—Document Clustering; Knowledge Discovery; Hashing; Frequent termsets; Apriori algorithm; Text Documents; Text Mining; Data Mining

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining

The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association...

متن کامل

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

Document clustering is one of the important research issues in the field of text mining, where the documents are grouped without predefined categories or labels. High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent term sets for clustering. This paper proposes a new methodology for document clustering based on Asso...

متن کامل

Discovering Word Meanings Based on Frequent Termsets

Word meaning ambiguity has always been an important problem in information retrieval and extraction, as well as, text mining (documents clustering and classification). Knowledge discovery tasks such as automatic ontology building and maintenance would also profit from simple and efficient methods for discovering word meanings. The paper presents a novel text mining approach to discovering word ...

متن کامل

An Efficient Association Rule Mining Using the H-BIT Array Hashing Algorithm

Association Rule Mining (ARM) finds the interesting relationship between presences of various items in a given database. Apriori is the traditional algorithm for learning association rules. However, it is affected by number of database scan and higher generation of candidate itemsets. Each level of candidate itemsets requires separate memory locations. Hash Based Frequent Itemsets Quadratic Pro...

متن کامل

Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

The vast amount of textual information available in electronic form is growing at a staggering rate in recent times. The task of mining useful or interesting frequent itemsets (words/terms) from very large text databases that are formed as a result of the increasing number of textual data still seems to be a quite challenging task. A great deal of attention in research community has been receiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013